First inspect Section @ref(basicops) and @ref(vectorintro) (“Working with vectors”).
Calculate the following quantities:
1. The sum of 100.1, 234.9 and 12.01
100.1 + 234.9 + 12.01
## [1] 347.01
2. The square root of 256
sqrt(256)
## [1] 16
3. Calculate the 10-based logarithm of 100, and multiply the result with the cosine of \(\pi\). Hint: see ?log and ?pi.
log10(100)*cos(pi)
## [1] -2
4. Calculate the cumulative sum (‘running total’) of the numbers 2,3,4,5,6.
cumsum(c(2,3,4,5,6))
## [1] 2 5 9 14 20
5. Calculate the cumulative sum of those numbers, but in reverse order. Hint: use the rev function.
cumsum(rev(c(2,3,4,5,6)))
## [1] 6 11 15 18 20
6. Find 10 random numbers between 0 and 100, rounded to the nearest whole number (Hint: you can use either sample or a combination of round and runif).
# sample
sample(0:100,10)
## [1] 58 53 97 82 71 6 45 4 74 65
# runif and round
round(runif(10, 0, 100), 0)
## [1] 57 33 65 91 48 55 88 96 60 39
Type the following code, which assigns numbers to objects x and y.
x <- 10
y <- 20
1. Calculate the product of x and y
x * y
## [1] 200
2. Store the result in a new object called z
z <- x * y
z
## [1] 200
3. Inspect your workspace by typing ls(), and by clicking the Environment tab in Rstudio, and find the three objects you created.
4. Make a vector of the objects x, y and z. Use this command,
myvec <- c(x,y,z)
5. Find the minimum, maximum, length, and variance of myvec.
min(myvec)
## [1] 10
max(myvec)
## [1] 200
length(myvec)
## [1] 3
var(myvec)
## [1] 11433.33
6. Remove the myvec object from your workspace.
rm(myvec)
If the last exercise was easy, you can skip this one.
1. The numbers below are the first ten days of rainfall amounts in 1996. Read them into a vector using the c() function.
0.1 0.6 33.8 1.9 9.6 4.3 33.7 0.3 0.0 0.1
rainfall <- c(0.1,0.6, 33.8, 1.9, 9.6, 4.3, 33.7, 0.3, 0.0, 0.1)
Inspect the table with functions in the Section @ref(vectorintro) (“Working with vectors”), and answer the following questions:
2. Make a vector with the min, max and mean of rainfall. You can also name elements of a vector, for example c(x = 1, y = 2).
mean(rainfall)
## [1] 8.44
sd(rainfall)
## [1] 13.66473
3. Calculate the cumulative rainfall (‘running total’) over these ten days. Confirm that the last value of the vector that this produces is equal to the total sum of the rainfall.
cumsum(rainfall)
## [1] 0.1 0.7 34.5 36.4 46.0 50.3 84.0 84.3 84.3 84.4
4. Which day saw the highest rainfall (write code to get the answer)?
which.max(rainfall)
## [1] 3
This exercise will make sure you are able to make a ‘reproducable script’, that is, a script that will allow you to repeat an analysis without having to start over from scratch. First, set up an R script, and save it in your current working directory.
1. Find the History tab in Rstudio. Copy a few lines of history that you would like to keep to the script you just opened, by selecting the line with the mouse and clicking To Source. 2. Tidy up your R script by writing a few comments starting with \#. 3. Now make sure your script works completely (that is, it is entirely reproducible). First clear the workspace (rm(list=ls()) or click Clear from the Environment tab). Then, run the entire script (by clicking Source in the script window, top-right).
This short exercise points out the use of quotes in R.
1. Run the following code, which makes two numeric objects.
one <- 1
two <- 2
2. Run the following two lines of code, and look at the resulting two vectors. The first line makes a character vector, the second line a numeric vector by recalling the objects you just constructed. Make sure you understand the difference.
vector1 <- c("one","two")
vector2 <- c(one, two)
vector1 <- c("one","two")
vector2 <- c(one, two)
vector1
## [1] "one" "two"
vector2
## [1] 1 2
3. The following lines of code contain some common errors that prevent them from being evaluated properly or result in error messages. Look at the code without running it and see if you can identify the errors and correct them all. Also execute the faulty code by copying and pasting the text into the console (not typing it, R studio will attempt to avoid these errors by default) so you get to know some common error messages (but not all of these result in errors!).
vector1 <- c('one', 'two', 'three', 'four, 'five', 'seven')
vec.var <- var(c(1, 3, 5, 3, 5, 1)
vec.mean <- mean(c(1, 3, 5, 3, 5, 1))
vec.Min <- Min(c(1, 3, 5, 3, 5, 1))
Vector2 <- c('a', 'b', 'f', 'g')
vector2
# vector1 <- c('one', 'two', 'three', 'four, 'five', 'seven')
# missing apostrophe after 'four'
vector1 <- c('one', 'two', 'three', 'four', 'five', 'seven')
# vec.var <- var(c(1, 3, 5, 3, 5, 1)
# missing closing parenthesis
# vec.mean <- mean(c(1, 3, 5, 3, 5, 1))
vec.var <- var(c(1, 3, 5, 3, 5, 1))
vec.mean <- mean(c(1, 3, 5, 3, 5, 1))
# vec.Min <- Min(c(1, 3, 5, 3, 5, 1))
# the 'min' function should have a lower-case 'm'
vec.Min <- min(c(1, 3, 5, 3, 5, 1))
# Vector2 <- c('a', 'b', 'f', 'g')
# vector2
# lower-case 'v' used here,
# upper-case 'V' used when defining variable in line above
vector2 <- c('a', 'b', 'f', 'g')
vector2
## [1] "a" "b" "f" "g"
First make sure you understand Section @ref(vectorized).
1. You have measured five cylinders, their lengths are:
2.1, 3.4, 2.5, 2.7, 2.9
and the diameters are :
0.3, 0.5, 0.6, 0.9, 1.1
Read these data into two vectors (give the vectors appropriate names).
lengths <- c(2.1, 3.4, 2.5, 2.7, 2.9)
diameters <- c(0.3, 0.5, 0.6, 0.9, 1.1)
2. Calculate the correlation between lengths and diameters (use the cor function).
cor(lengths, diameters)
## [1] 0.3282822
3. Calculate the volume of each cylinder (V = length * pi * (diameter / 2)2).
# Calculate volumes and store in new vector
volumes <- lengths * pi * (diameters / 2)^2
# Look at the volumes
volumes
## [1] 0.1484403 0.6675884 0.7068583 1.7176658 2.7559622
4. Calculate the mean, standard deviation, and coefficient of variation of the volumes.
mean(volumes)
## [1] 1.199303
sd(volumes)
## [1] 1.039402
sd(volumes) / mean(volumes)
## [1] 0.8666714
5. Assume your measurements are in centimetres. Recalculate the volumes so that their units are in cubic millimetres. Calculate the mean, standard deviation, and coefficient of variation of these new volumes.
volumes.mm <- 10 * lengths * pi * (10 * diameters / 2)^2
mean(volumes.mm)
## [1] 1199.303
sd(volumes.mm)
## [1] 1039.402
sd(volumes.mm) / mean(volumes.mm)
## [1] 0.8666714
For the second question, you need to know that the 26 letters of the Roman alphabet are conveniently accessible in R via letters and LETTERS. These are not functions, but vectors that are always loaded.
1. Using c() and rep() make this vector (in one line of code):
"A" "A" "A" "B" "B" "B" "C" "C" "C" "D" "D" "D"
and this:
"A" "B" "C" "D" "A" "B" "C" "D" "A" "B" "C" "D"
If you are unsure, look at ?rep.
lets <- c("A", "B", "C", "D")
rep(lets, each = 3)
## [1] "A" "A" "A" "B" "B" "B" "C" "C" "C" "D" "D" "D"
rep(lets, times = 3)
## [1] "A" "B" "C" "D" "A" "B" "C" "D" "A" "B" "C" "D"
# The times argument can be omitted,
rep(lets, 3)
## [1] "A" "B" "C" "D" "A" "B" "C" "D" "A" "B" "C" "D"
2. Draw 10 random letters from the lowercase alphabet, and sort them alphabetically (Hint: use sample and sort). The solution can be one line of code.
# First inspect letters, it is a vector:
letters
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
# Sample ten random letters (optionally, set replace=TRUE for sampling with replacement)
sample(letters, 10)
## [1] "e" "q" "l" "x" "o" "u" "i" "b" "d" "j"
# And, in one line of code, sort them alphabetically
sort(sample(letters, 10))
## [1] "f" "j" "l" "n" "o" "r" "t" "v" "w" "x"
3. Draw 5 random letters from each of the lowercase and uppercase alphabets, incorporating both into a single vector, and sort it alphabetically.
# If you like, you can store both sets of 5 letters in a vector, then combine:
low <- sample(letters, 5)
upp <- sample(LETTERS, 5)
# And then sort the combination of these vectors (use c() for that)
sort(c(low,upp))
## [1] "b" "B" "H" "k" "o" "t" "u" "U" "V" "W"
# All of this can be achieved in one line of code
sort(c(sample(letters, 5),sample(LETTERS, 5)))
## [1] "c" "C" "E" "F" "g" "H" "p" "t" "v" "Z"
4. Repeat the above exercise but sort the vector alphabetically in descending order.
# Inspect the help page ?sort, to find,
sort(c(sample(letters, 5),sample(LETTERS, 5)), decreasing = TRUE)
## [1] "U" "u" "t" "Q" "p" "O" "K" "k" "i" "F"
This question is harder! You learn three new functions and have to learn by experimenting.
Inspect the help page union, and note the useful functions union, setdiff and intersect. These can be used to compare and combine two vectors. Make two vectors :
x <- c(1,2,5,9,11)
y <- c(2,5,1,0,23)
Experiment with the three functions to find solutions to these questions.
1. Find values that are contained in both x and y
# First read the vectors
x <- c(1,2,5,9,11)
y <- c(2,5,1,0,23)
# Having read the help page, it seems we need to use intersect()
intersect(x,y)
## [1] 1 2 5
2. Find values that are in x but not y (and vice versa).
# Note difference between,
setdiff(x,y)
## [1] 9 11
# and
setdiff(y,x)
## [1] 0 23
3. Construct a vector that contains all values contained in either x or y, and compare this vector to c(x,y).
# union() finds values that are either in x or y,
union(x,y)
## [1] 1 2 5 9 11 0 23
# ... whereas c(x,y) simply concatenates (glues together) the two vectors
c(x,y)
## [1] 1 2 5 9 11 2 5 1 0 23
In this exercise you will practice some basic skills with matrices. Recall Section @ref(matrices).
1. Construct a matrix with 10 columns and 10 rows, all filled with random numbers between 0 and 1.
m <- matrix(runif(100), ncol=10)
2. Calculate the row means of this matrix (Hint: use rowMeans). Also calculate the standard deviation across the row means (now also use sd).
rowMeans(m)
## [1] 0.3649248 0.4067396 0.4231614 0.4785479 0.5719761 0.3893457 0.3256394
## [8] 0.5117232 0.4958670 0.5178928
sd(rowMeans(m))
## [1] 0.07829437
3. Now remake the above matrix with 100 columns, and 10 rows. Then calculate the column means (using, of course, colMeans), and plot a frequency diagram (a ‘histogram’) using hist. We will see this function in more detail in a later chapter, but it is easy enough to use as you just do hist(myvector), where myvector is any numeric vector (like the column means). What sort of shape of the histogram do you expect? Now repeat the above with more rows, and more columns.
# This resembles a normal distribution, a little bit?
m <- matrix(runif(100*10), ncol=100)
cm <- colMeans(m)
hist(cm)
# Yes, it does! This is the central limit theorem at work,
# the sum or mean of a bunch of random numbers follow a normal distribution.
m <- matrix(runif(1000*100), ncol=1000)
cm <- colMeans(m)
hist(cm, breaks=20) # breaks argument to make more bins
This exercise makes sure you know how to install packages, and load them. First, read the first subsections of @ref(packages), on installing and loading packages.
1. Install the car package (you only have to do this once for any computer).
install.packages("car")
2. Load the car package (you have to do this every time you open Rstudio).
library(car)
3. Look at the help file for densityPlot.
?densityPlot
4. Run the example for densityPlot (at the bottom of the help file), by copy-pasting the example into a script, and then executing it.
# Scroll down in the help file, and simply copy-paste the code into the R
# console.
5. Run the example for densityPlot again, but this time use the example function:
example(densityPlot)
Follow the instructions to cycle through the different steps.
example(densityPlot)
6. Explore the contents of the car package by clicking first on the Packages tab, then finding the car package, and clicking on that. This way, you can find out about all functions a package contains (which, normally, you hope to avoid, but sometimes it is the only way to find what you are looking for). The same list can be obtained with the command library(help=car), but that displays a list that is not clickable, so probably not as useful.
Recall Exercise @ref(vecexerc1). Load the rainfall data once more.
We now practice subsetting a vector (see Section @ref(vectorindexing)).
1. Take a subset of the rainfall data where rain is larger than 20.
rainfall[rainfall > 20]
## [1] 33.8 33.7
1. What is the mean rainfall for days where the rainfall was at least 4?
mean(rainfall[rainfall >= 4])
## [1] 20.35
1. Subset the vector where it is either exactly zero, or exactly 0.6.
rainfall[rainfall == 0 | rainfall == 0.6]
## [1] 0.6 0.0
# Alternative solution,
rainfall[rainfall %in% c(0, 0.6)]
## [1] 0.6 0.0
The 26 letters of the Roman alphabet are conveniently accessible in R via letters and LETTERS. These are not functions, but vectors that are always loaded.
1. What is the 18th letter of the alphabet?
# You could have guessed,
LETTERS[18]
## [1] "R"
# or letters[18]
2. What is the last letter of the alphabet (pretend you don’t know the alphabet has 26 letters)?
# Use length() to index the vector,
letters[length(letters)]
## [1] "z"
# You could use you knowledge that the alphabet contains 26 letters, but the above
# solution is more general.
3. Use ?sample to figure out how to sample with replacement. Generate a random subset of fifteen letters. Are any letters in the subset duplicated? Hint: use the any and duplicated functions. Which letters?
# 1. Random subset of 15 letters
let15 <- sample(letters, 15, replace = TRUE)
# 2. Any duplicated?
any(duplicated(let15))
## [1] TRUE
# 3. Which are duplicated?
# This tells us the index of the letter that is replicated,
which(duplicated(let15))
## [1] 7 10 12 15
# We can use it to index letters to find the actual letters
let15[which(duplicated(let15))]
## [1] "f" "n" "w" "r"
First read or otherwise understand Section @ref(vecdataframes).
For this exercise, we will use the cereals dataset (from the lgrdata package, use data(cereals) to load it).
1. Read in the dataset, look at the first few rows with head and inspect the data types of the variables in the dataframe with str.
library(lgrdata)
data(cereals)
str(cereals)
## 'data.frame': 77 obs. of 13 variables:
## $ Cereal.name : chr "100%_Bran" "100%_Natural_Bran" "All-Bran" "All-Bran_with_Extra_Fiber" ...
## $ Manufacturer: Factor w/ 7 levels "A","G","K","N",..: 4 6 3 3 7 2 3 2 7 5 ...
## $ Cold.or.Hot : Factor w/ 2 levels "C","H": 1 1 1 1 1 1 1 1 1 1 ...
## $ calories : int 70 120 70 50 110 110 110 130 90 90 ...
## $ protein : int 4 3 4 4 2 2 2 3 2 3 ...
## $ fat : int 1 5 1 0 2 2 0 2 1 0 ...
## $ sodium : int 130 15 260 140 200 180 125 210 200 210 ...
## $ fiber : num 10 2 9 14 1 1.5 1 2 4 5 ...
## $ carbo : num 5 8 7 8 14 10.5 11 18 15 13 ...
## $ sugars : int 6 8 5 0 8 10 14 8 6 5 ...
## $ potass : int 280 135 320 330 NA 70 30 100 125 190 ...
## $ vitamins : int 25 0 25 25 25 25 25 25 25 25 ...
## $ rating : num 68.4 34 59.4 93.7 34.4 ...
head(cereals)
## Cereal.name Manufacturer Cold.or.Hot calories protein fat
## 1 100%_Bran N C 70 4 1
## 2 100%_Natural_Bran Q C 120 3 5
## 3 All-Bran K C 70 4 1
## 4 All-Bran_with_Extra_Fiber K C 50 4 0
## 5 Almond_Delight R C 110 2 2
## 6 Apple_Cinnamon_Cheerios G C 110 2 2
## sodium fiber carbo sugars potass vitamins rating
## 1 130 10.0 5.0 6 280 25 68.40297
## 2 15 2.0 8.0 8 135 0 33.98368
## 3 260 9.0 7.0 5 320 25 59.42551
## 4 140 14.0 8.0 0 330 25 93.70491
## 5 200 1.0 14.0 8 NA 25 34.38484
## 6 180 1.5 10.5 10 70 25 29.50954
2. Add a new variable to the dataset called totalcarb, which is the sum of carbo and sugars. You can do this is in (at least) three ways! (One of which is mutate from dplyr - a good idea to start using this powerful function).
# Option 1.
cereals$totalcarb <- cereals$carbo + cereals$sugars
# Alternatively:
cereals$totalcarb <- with(cereals, carbo + sugars)
# Alternatively:
cereals <- mutate(cereals,
totalcarb = carbo + sugars)
3. How many cereals in the dataframe are ‘hot’ cereals? Either make a subset and count the rows, or subset the Cold.or.Hot vector directly, and determine its length (do not use table yet, or a similar function!).
# Solution 1 : using subset on the dataframe
hotcereals <- subset(cereals, Cold.or.Hot == "H")
# Or (faster for large datasets):
library(dplyr)
hotcereals <- filter(cereals, Cold.or.Hot == "H")
# how many rows?
nrow(hotcereals)
## [1] 3
# Solution 2 : indexing the vector 'Cold.or.Hot'
hot <- cereals$Cold.or.Hot[cereals$Cold.or.Hot == "H"]
length(hot)
## [1] 3
4. How many unique manufacturers are included in the dataset? Hint: use length and unique.
length(unique(cereals$Manufacturer))
## [1] 7
5. Now try the n_distinct function from dplyr.
6. Take a subset of the dataframe of all cereals that have less than 80 calories, AND have more than 20 units of vitamins.
subset(cereals, calories < 80 & vitamins > 20)
## Cereal.name Manufacturer Cold.or.Hot calories protein fat
## 1 100%_Bran N C 70 4 1
## 3 All-Bran K C 70 4 1
## 4 All-Bran_with_Extra_Fiber K C 50 4 0
## sodium fiber carbo sugars potass vitamins rating totalcarb
## 1 130 10 5 6 280 25 68.40297 11
## 3 260 9 7 5 320 25 59.42551 12
## 4 140 14 8 0 330 25 93.70491 8
# OR
# Note how with dplyr we can use a comma to separate AND (&)
# statements. It is also much faster for larger datasets.
dplyr::filter(cereals,
calories < 80,
vitamins > 20)
## Cereal.name Manufacturer Cold.or.Hot calories protein fat
## 1 100%_Bran N C 70 4 1
## 2 All-Bran K C 70 4 1
## 3 All-Bran_with_Extra_Fiber K C 50 4 0
## sodium fiber carbo sugars potass vitamins rating totalcarb
## 1 130 10 5 6 280 25 68.40297 11
## 2 260 9 7 5 320 25 59.42551 12
## 3 140 14 8 0 330 25 93.70491 8
7. Take a subset of the dataframe containing cereals that contain at least 1 unit of sugar, and keep only the variables ‘Cereal.name’, ‘calories’ and ‘vitamins’. Then inspect the first few rows of the dataframe with head.
# Using 'subset' (base R)
cereal_subs <- subset(cereals,
sugars >= 1,
select=c(Cereal.name, calories, vitamins))
# dplyr:
library(dplyr)
cereal_subs <- filter(cereals,
sugars >= 1) %>%
dplyr::select(Cereal.name, calories, vitamins)
# You can also use square bracket indexing.
# In practice you will rarely need this, but it is good to understand
# how it works:
cereal_subs <- cereals[cereals$sugars >= 1, c("Cereal.name","calories","vitamins")]
# In the above, '>=' means 'larger than or equal to'.
# Look at first few rows of the new subset.
head(cereal_subs)
## Cereal.name calories vitamins
## 1 100%_Bran 70 25
## 2 100%_Natural_Bran 120 0
## 3 All-Bran 70 25
## 5 Almond_Delight 110 25
## 6 Apple_Cinnamon_Cheerios 110 25
## 7 Apple_Jacks 110 25
8. For one of the above subsets, write a new CSV file to disk using write.csv (see Section @ref(exportingdata)).
# Use row.names=FALSE to avoid unnecessary row numbers in the CSV file.
write.csv(cereal_subs, "cerealsubset.csv", row.names=FALSE)
9. Rename the column ‘Manufacturer’ to ‘Producer’ (see Section @ref(namesdataframe)).
# First look at the names:
names(cereals)
# So, the second name is Manufacturer. Change it:
names(cereals)[2] <- "Producer"
# for dplyr, we need to know what the current name is:
cereals <- rename(cereals,
Producer = Manufacturer)
1. Read the following data into R (number of cuckoos seen in a week by an avid birdwatcher). Give the resulting dataframe a reasonable name. Hint:To read this dataset, look at Section @ref(datainscript) for a possibility (there are at least two ways to read this dataset, or you can type it into Excel and save as a CSV file if you prefer).
Day nrbirds
sunday 3
monday 2
tuesday 5
wednesday 0
thursday 8
friday 1
saturday 2
cuckoos <- read.csv(text="
Day, nrbirds
sunday, 3
monday, 2
tuesday, 5
wednesday, 0
thursday, 8
friday, 1
saturday, 2")
2. Add a day number to the dataset you read in above (sunday=1, saturday=7). Recall the seq function (Section @ref(sequences)).
cuckoos$daynumber <- 1:7
3. Delete the ‘Day’ variable (to only keep the daynumber that you added above).
# Solution 1
cuckoos2 <- subset(cuckoos, select=-Day)
# Solution 2 (simply deletes the first column)
cuckoos2 <- cuckoos[,-1]
# Solution 3
cuckoos2 <- dplyr::select(cuckoos, -Day)
4. On which daynumber did you observe the most honeyeaters? Hint: use which.max, in combination with indexing.
cuckoos$daynumber[which.max(cuckoos$nrbirds)]
## [1] 5
5. Sort the dataset by number of birds seen. Hint: use the order function to find the order of the number of birds, then use this vector to index the dataframe.
nrbird_order <- order(cuckoos$nrbirds)
honey_sorted <- cuckoos[nrbird_order,]
1. Read the titanic data from the lgrdata package.
data(titanic)
2. Make two new dataframes : a subset of male survivors, and a subset of female survivors. Recall Section @ref(subsetdataframes). Use filter from dplyr.
# Always first inspect your data to see what it looks like
head(titanic)
## Name PClass Age Sex
## 1 Allen, Miss Elisabeth Walton 1st 29.00 female
## 2 Allison, Miss Helen Loraine 1st 2.00 female
## 3 Allison, Mr Hudson Joshua Creighton 1st 30.00 male
## 4 Allison, Mrs Hudson JC (Bessie Waldo Daniels) 1st 25.00 female
## 5 Allison, Master Hudson Trevor 1st 0.92 male
## 6 Anderson, Mr Harry 1st 47.00 male
## Survived
## 1 1
## 2 0
## 3 0
## 4 0
## 5 1
## 6 1
# Probably already loaded!
library(dplyr)
# Then take subsets
titanic_male <- filter(titanic,
Sex == "male",
Survived == 1)
titanic_female <- filter(titanic,
Sex == "female",
Survived == 1)
3. Based on the previous question, what was the name of the oldest surviving male? In what class was the youngest surviving female? Hint: use which.max, which.min on the subsets you just created.
The easiest solution here is to use square bracket notation, but you can also solve this question in steps.
# Base R has the shortest solution:
titanic_male$Name[which.max(titanic_male$Age)]
## [1] Frolicher-Stehli, Mr Maxmillian
## 1310 Levels: Abbing, Mr Anthony ... Zimmerman, Leo
titanic_female$PClass[which.min(titanic_female$Age)]
## [1] 3rd
## Levels: 1st 2nd 3rd
4. Take 15 random names of passengers from the Titanic, and sort them alphabetically. Hint: use sort.
sample15 <- sample(titanic$Name, 15)
sort(sample15)
## [1] Candee, Mrs Edward (Helen Churchill Hungerford)
## [2] Davies, Mr Charles Henry
## [3] Dean, Master Bertram Vere
## [4] Hart, Mr Benjamin
## [5] Hegarty, Miss Nora
## [6] Jensen, Mr Niels Peder
## [7] Lyntakoff, Mr Stanko
## [8] Mallet, Master Andre
## [9] McCaffry, Mr Thomas Francis
## [10] McElroy, Mr Michael
## [11] Navratil, Mr Michel
## [12] Olsson, Mr Nils Johan
## [13] Pallas y Castello, Mr Emilio
## [14] Pasic, Mr Jakob
## [15] Saad, Mr Amin
## 1310 Levels: Abbing, Mr Anthony ... Zimmerman, Leo
1. Convert the ‘Name’ (passenger name) variable to a ‘character’ variable, and store it in the dataframe. See Section @ref(textgrep).
titanic$Name <- as.character(titanic$Name)
2. How many observations of ‘Age’ are missing from the dataframe? See examples in Section @ref(workingmissing).
# Look at summary:
summary(titanic$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.17 21.00 28.00 30.40 39.00 71.00 557
# Or count the number of missing values directly.
sum(is.na(titanic$Age))
## [1] 557
3. Make a new variable called ‘Status’, based on the ‘Survived’ variable already in the dataset. For passengers that did not survive, Status should be ‘dead’, for those who did, Status should be ‘alive’. Make sure this new variable is a factor. See the example with the ifelse function in Section @ref(workingfactors).
# ifelse may return a factor, but to be sure we also use factor().
titanic$Status <- factor(ifelse(titanic$Survived==0,"dead","alive"))
4. Count the number of passengers in each class (1st, 2nd, 3rd). Hint: use table as shown in Section @ref(workingfactors).
# The easiest way to count number of observations for each level of a factor is table()
table(titanic$PClass)
##
## 1st 2nd 3rd
## 322 280 711
5. Using grep, find the six passengers with the last name ‘Fortune’. Make this subset into a new dataframe. Did they all survive? Hint: to do this, make sure you recall how to use one vector to index a dataframe (see Section @ref(subsetdataframes)). Also, the all function might be useful here (see Section @ref(workinglogic)).
# Solution with 'grepl' (which returns a vector of TRUE/FALSE)
fortunepass <- subset(titanic, grepl("Fortune", Name))
# Did they *all* survive?
all(fortunepass$Survived == 1)
## [1] FALSE
6. As in 2., for what proportion of the passengers is the age unknown? Was this proportion higher for 3rd class than 1st and 2nd? Hint: First make a subset of the dataframe where age is missing (see Section @ref(naindataframe)), and then use table, as well as nrow.
# Solution 1 First subset the data to extract only the ones with missing
# Age. Then, count the number of observations by PClass
titanic_missage <- subset(titanic, is.na(Age))
table(titanic_missage$PClass)
##
## 1st 2nd 3rd
## 96 68 393
# Finally, divide by the total number of rows to get the proportions:
table(titanic_missage$PClass)/nrow(titanic_missage)
##
## 1st 2nd 3rd
## 0.1723519 0.1220826 0.7055655
Use the hydro dam data (used in Section @ref(datesdataframe)).
1. Start by reading in the data. Change the first variable to a Date class (see Section @ref(readingdates)).
data(hydro)
# Note that the format of Date was D/M/Y.
library(lubridate)
hydro$Date <- dmy(hydro$Date)
2. Are the successive measurements in the dataset always exactly one week apart? Hint: use diff.
# diff() gives us the sequential differences, we can then list the unique values
# of these differences.
unique(diff(hydro$Date))
## [1] 7
3. Assume that a dangerously low level of the dam is 235 \(Gwh\). How many weeks was the dam level equal to or lower than this value?
# Because the answer to the previous question was yes,
# we can just count the number of observations where storage was < 235, like so:
sum(hydro$storage < 235)
## [1] 24
4. (Hard question). For question 2., how many times did storage decrease below 235 (regardless of how long it remained below 235)? Hint: use diff and subset).
# SOLUTION 1
# Take a subset of the data where storage < 235
hydrolow <- subset(hydro, storage < 235)
# Look at time difference between successive dates
diff(hydrolow$Date)
## Time differences in days
## [1] 7 7 7 7 7 7 7 7 7 7 7 7 35 7 7 7 7
## [18] 238 7 63 7 7 7
# whenever this time difference is larger than 7,
# the level has dipped below 235 again
# (plus one, because the hydrolow dataset starts below 235)
sum(diff(hydrolow$Date) > 7) + 1
## [1] 4
# SOLUTION 2
# Add variable that is 0 when storage < 235,
# 1 otherwise:
hydro$storageBinary <- ifelse(hydro$storage< 235,0,1)
# Now, diff gives -1 when storage dipped below 235:
diff(hydro$storageBinary)
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [24] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [47] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [70] 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0
## [93] 0 0 1 0 0 0 -1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## [116] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0
## [139] 1 0 0 0 0 0 0 0 -1 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## [162] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [185] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [208] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [231] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [254] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [277] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [300] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# How many times did it dip below 235?
difs <- diff(hydro$storageBinary)
sum(difs == -1)
## [1] 4
Use the data for an experiment where trees were irrigated and/or fertilized (the hfeplotmeans dataset).
1. Read the data, write a copy to a new object called trees (easier to type!) and look at various summaries of the dataset. Use summary, str and describe (the latter is in the Hmisc package).
data(hfeifplotmeans)
trees <- hfeifplotmeans
summary(trees)
str(trees)
# Instead of library(Hmisc), which avoids some conflicts
Hmisc::describe(trees)
2. From these summaries, find out how many missing values there are for height and diameter. Also count the number of missing values as shown in Section @ref(workingmissing).
# is.na() gives a vector of TRUE/FALSE, which we can sum because TRUE is coded as 1,
# FALSE coded as 0.
sum(is.na(trees$height))
## [1] 48
sum(is.na(trees$diameter))
## [1] 48
3. Inspect the levels of the treatment (treat), with the levels function. Also count the number of levels with the nlevels function. Now assign new levels to the factor, replacing the abbreviations with a more informative label. Follow the example in Section @ref(changelevels).
# Inspect levels
levels(trees$treat)
## [1] "C" "F" "I" "IL"
# Count levels
nlevels(trees$treat)
## [1] 4
# Replace levels
levels(trees$treat) <- c("Control","Fertilized","Irrigated","Liquid fertilizer")
4. Using table, count the number of observations by treat, to check if the dataset is balanced. Be aware that table simply counts the number of rows, regardless of missing values. Now take a subset of the dataset where height is not missing, and check the number of observations again.
# Count by factor levels
table(trees$treat)
##
## Control Fertilized Irrigated Liquid fertilizer
## 80 80 80 80
# Take subset of non-missing data and try again
trees_nona <- subset(trees, !is.na(height))
table(trees_nona$treat)
##
## Control Fertilized Irrigated Liquid fertilizer
## 68 68 68 68
5. For which dates do missing values occur in height in this dataset? Hint: use a combination of is.na and unique.
# First make it a Date class:
library(lubridate)
trees$Date <- mdy(trees$Date)
# Then find unique Dates where height was NA:
unique(trees$Date[is.na(trees$height)])
## [1] "2007-10-01" "2007-11-01" "2007-12-01"
# Or, alternatively:
unique(trees[is.na(trees$height), "Date"])
## [1] "2007-10-01" "2007-11-01" "2007-12-01"
In this exercise, you will practice working with Dates and Date-Time combinations, with a timeseries dataset (the fluxtower dataset from the lgrdata package).
In this dataset, a new row was produced every 30min - of various meteorological measurements above a forest in Spain. For example, FCO2 is the flux of carbon dioxide out of forest, so that negative values indicate photosynthesis.
1. Read the dataframe. Rename the first column to ‘DateTime’ (recall Section @ref(namesdataframe)).
data(fluxtower)
# Rename the first column to 'DateTime'.
names(fluxtower)[1] <- "DateTime"
2. Convert DateTime to a POSIXct class. Beware of the formatting (recall Section @ref(datetime)).
library(lubridate)
# Note the format was in D/M/Y H:M.
fluxtower$DateTime <- dmy_hm(fluxtower$DateTime)
3. Did the above action produce any missing values? Were these already missing in the original dataset?
# is.na returns a vector of TRUE/FALSE, any() will be TRUE if at least one of the
# supplied values is TRUE.
any(is.na(fluxtower$DateTime))
## [1] TRUE
4. Add a variable to the dataset called ‘Quality’. This variable should be ‘bad’ when the variable ‘ustar’ is less than 0.15, and ‘good’ otherwise. Recall the example in Section @ref(workingfactors).
fluxtower$Quality <- as.factor(ifelse(fluxtower$ustar < 0.15, "bad","good"))
5. Add a ‘month’ column to the dataset, as well as ‘year’.
# Abbreviated month : see ?month for more options
fluxtower$month <- month(fluxtower$DateTime, label=TRUE)
# Year
fluxtower$year <- year(fluxtower$Date)
6. Look at the ‘Rain’ column. There are some problems; re-read the data or find another way to display NA whenever the data have an invalid value. Hint: look at the argument na.strings in read.table.
Did it rain on this forest in Spain?
# str() shows us that rain is not a numeric variable as one might expect:
str(fluxtower$Rain)
## Factor w/ 2 levels "#DIV/0!","0": 2 2 2 2 2 2 2 2 2 2 ...
# Solution 2: set bad values to NA
fluxtower$Rain[fluxtower$Rain == "#DIV/0!"] <- NA
DNA sequences can also be represented using text strings. In this exercise, you will make an artificial DNA sequence.
1. Make a random DNA sequence, consisting of a 100 random selections of the letters C,A,G,T, and paste the result together into one character string (Hint : use sample as in Section @ref(randomnumbers) with replacement, and use paste as shown in Section @ref(workingtext). Write it in one line of R code.
paste(sample(c("C","A","G","T"), 100, replace=TRUE), collapse="")
## [1] "TAAACCTGAATTTCTTTCAACAGCTGACCCTTGTCAAGTTTGCGTCCGGCGAATTCTTTCATTCGCCGTATGCTCTCCGGCCCCATTCGGTACACGAGGA"
str, summary, contents and describe (recall that the last two are in the Hmisc package). Interpret the results.library(lgrdata)
data(cereals)
# structure
str(cereals)
# summary
summary(cereals)
# contents and describe are in the Hmisc package
Hmisc::contents(cereals)
Hmisc::describe(cereals)
Manufacturer. Use either summaryBy or dplyr.# use summaryBy() from the doBy package to get means of these three variables by every
# level of Manufacturer
library(doBy)
summaryBy(sodium + fiber + carbo ~ Manufacturer, data=cereals, FUN=mean)
## Manufacturer sodium.mean fiber.mean carbo.mean
## 1 A 0.0000 0.000000 16.00000
## 2 G 200.4545 1.272727 14.72727
## 3 K 174.7826 2.739130 15.13043
## 4 N 37.5000 4.000000 16.00000
## 5 P 146.1111 2.777778 13.22222
## 6 Q 92.5000 1.337500 NA
## 7 R 198.1250 1.875000 17.62500
na.rm=TRUE, because the dataset contains missing values.# Step 1 : Values greater than 150 will be 'high', values less than 150 'low'
cereals$sodiumClass <- factor(ifelse(cereals$sodium > 150,"high","low"))
# Step 2: Summarize with summaryBy
library(doBy)
summaryBy(sugars ~ sodiumClass, data=cereals, FUN=c(min,max,mean), na.rm=TRUE)
## sodiumClass sugars.min sugars.max sugars.mean
## 1 high 1 14 7.066667
## 2 low 0 15 6.967742
# OR Step 2 : use dplyr.
library(dplyr)
group_by(cereals, sodiumClass) %>%
summarize(sugars_min = min(sugars, na.rm=TRUE),
sugars_max = max(sugars, na.rm=TRUE),
sugars_mean = mean(sugars, na.rm=TRUE))
## # A tibble: 2 x 4
## sodiumClass sugars_min sugars_max sugars_mean
## <fct> <int> <int> <dbl>
## 1 high 1 14 7.07
## 2 low 0 15 6.97
# If you want, do both in one step:
mutate(cereals,
sodiumClass = factor(ifelse(sodium > 150,"high","low"))) %>%
group_by(sodiumClass) %>%
summarize(sugars_min = min(sugars, na.rm=TRUE),
sugars_max = max(sugars, na.rm=TRUE),
sugars_mean = mean(sugars, na.rm=TRUE))
## # A tibble: 2 x 4
## sodiumClass sugars_min sugars_max sugars_mean
## <fct> <int> <int> <dbl>
## 1 high 1 14 7.07
## 2 low 0 15 6.97
# BUT notice how we do not add the column `sodiumClass` to cereals;
# we only use it to 'pipe' to group_by and summarize.
tapply. Inspect the result and notice there are missing values. Try to use na.rm=TRUE as an additional argument to tapply, only to find out that the values are still missing. Finally, use xtabs (see Section @ref(xtabs), p. @ref(xtabs)) to count the number of observations by the two factors to find out if we have missing values in the tapply result.# using tapply (make a 2x2 table)
# We use na.rm=TRUE to remove missing values, this argument will be passed to max()
with(cereals, tapply(sugars, list(sodiumClass,Manufacturer), FUN=max, na.rm=TRUE))
## A G K N P Q R
## high NA 14 12 NA 14 12 8
## low 3 12 15 6 15 8 11
# using xtabs (count number of observations per group)
# the result shows that two manufacturers have no cereals in the 'high' sodium class
xtabs( ~ sodiumClass + Manufacturer, data=cereals)
## Manufacturer
## sodiumClass A G K N P Q R
## high 0 18 14 0 5 2 6
## low 1 4 9 6 4 6 2
summaryBy or dplyr. Compare the results.# Results are the same as with tapply, except the result is a dataframe, not a 2x2 table.
library(doBy)
summaryBy(sugars ~ sodiumClass + Manufacturer, data=cereals, FUN=max, na.rm=TRUE)
## sodiumClass Manufacturer sugars.max
## 1 high G 14
## 2 high K 12
## 3 high P 14
## 4 high Q 12
## 5 high R 8
## 6 low A 3
## 7 low G 12
## 8 low K 15
## 9 low N 6
## 10 low P 15
## 11 low Q 8
## 12 low R 11
# Results are the same as with tapply, except the result is a dataframe, not a 2x2 table.
group_by(cereals, sodiumClass, Manufacturer) %>%
summarize(sugars_max = max(sugars, na.rm=TRUE))
## # A tibble: 12 x 3
## # Groups: sodiumClass [2]
## sodiumClass Manufacturer sugars_max
## <fct> <fct> <int>
## 1 high G 14
## 2 high K 12
## 3 high P 14
## 4 high Q 12
## 5 high R 8
## 6 low A 3
## 7 low G 12
## 8 low K 15
## 9 low N 6
## 10 low P 15
## 11 low Q 8
## 12 low R 11
xtabs (see Section @ref(xtabs)).# base R
xtabs( ~ Cold.or.Hot + Manufacturer, data=cereals)
## Manufacturer
## Cold.or.Hot A G K N P Q R
## C 0 22 23 5 9 7 8
## H 1 0 0 1 0 1 0
# dplyr
group_by(cereals, Manufacturer) %>%
tally()
## # A tibble: 7 x 2
## Manufacturer n
## <fct> <int>
## 1 A 1
## 2 G 22
## 3 K 23
## 4 N 6
## 5 P 9
## 6 Q 8
## 7 R 8
}
memory from the lgrdata package), find the mean and maximum number of words recalled by ‘Older’ and ‘Younger’ age classes.data(memory)
# Solution 1: use tapply twice
with(memory, tapply(Words, Age, mean))
## Older Younger
## 10.06 13.16
with(memory, tapply(Words, Age, max))
## Older Younger
## 23 22
# Solution 2 : use summaryBy
library(doBy)
summaryBy(Words ~ Age, data=memory, FUN=c(mean,max))
## Age Words.mean Words.max
## 1 Older 10.06 23
## 2 Younger 13.16 22
hfemet2008 dataset contains meteorological measurements at a station near Sydney, Australia. Find the mean air temperature by month. To do this, first add the month variable as shown in Section @ref(datetime).# Read the data.
data(hfemet2008)
# Inspect the last few rows, here you can usually tell the format of DateTime
tail(hfemet2008)
## DateTime Tair AirPress RH VPD PAR Rain wind
## 17563 12/31/2008 21:00 17.14 100.60 77.8 0.4356361 0 0 0.000
## 17564 12/31/2008 21:30 16.30 100.60 81.8 0.3385888 0 0 0.000
## 17565 12/31/2008 22:00 16.14 100.61 83.4 0.3056883 0 0 0.031
## 17566 12/31/2008 22:30 17.02 100.68 78.1 0.4264954 0 0 0.079
## 17567 12/31/2008 23:00 16.42 100.69 82.3 0.3318132 0 0 0.007
## 17568 12/31/2008 23:30 15.67 100.66 86.9 0.2340969 0 0 0.000
## winddirection
## 17563 66.89
## 17564 66.89
## 17565 94.60
## 17566 118.20
## 17567 136.30
## 17568 143.50
# In this case it looks like the format is month/day/year.
library(lubridate)
hfemet2008$DateTime <- mdy_hm(hfemet2008$DateTime)
# Add month (1,2,...,12)
# Note: this requires the lubridate package
hfemet2008$month <- month(hfemet2008$DateTime)
# average Tair by month:
with(hfemet2008, tapply(Tair, month, mean, na.rm=TRUE))
## 1 2 3 4 5 6 7
## 22.742312 20.463714 19.940297 15.251501 11.894728 12.517657 9.205703
## 8 9 10 11 12
## 9.522370 14.989961 18.062657 19.417347 21.650544
pupae dataset. The data contain measurements of larva (‘pupae’) weight and ‘frass’ (excrement) production while allowed to feed on leaves, grown under different concentrations of carbon dioxide (CO2). Also read this short dataset, which gives a label ‘roomnumber’ for each CO\(_2\) treatment.To read this dataset, consider the data.frame function described in Section @ref(vecstodfr).
data(pupae)
# A new dataframe with the CO2 levels and room number.
CO2room <- data.frame(CO2_treatment=c(280,400), Roomnumber=1:2)
# Merge the two dataframes.
pupae <- merge(pupae, CO2room)
# Inspect to see if the merge was successful.
head(pupae)
## CO2_treatment T_treatment Gender PupalWeight Frass Roomnumber
## 1 280 ambient 0 0.244 1.900 1
## 2 280 ambient 1 0.319 2.770 1
## 3 280 ambient 0 0.221 NA 1
## 4 280 ambient 0 0.280 1.996 1
## 5 280 ambient 0 0.257 1.069 1
## 6 280 ambient 1 0.333 2.257 1
Read Section @ref(dplyrjoin), and learn how to merge more than two datasets together.
First, run the following code to construct three dataframes that we will attempt to merge together.
dataset1 <- data.frame(unit=letters[1:9], treatment=rep(LETTERS[1:3],each=3),
Damage=runif(9,50,100))
unitweight <- data.frame(unit=letters[c(1,2,4,6,8,9)], Weight = rnorm(6,100,0.3))
treatlocation <- data.frame(treatment=LETTERS[1:3], Glasshouse=c("G1","G2","G3"))
Weight. Merge the datasets in two ways to either include or exclude the units without Weight measurements. To do this, either use merge or dplyr::join twice - you cannot merge more than two datasets in one step.# First merge the dataset and the treatlocation dataframe.
dtreat <- merge(dataset1, treatlocation, by="treatment")
# When all=TRUE, we get NA where some Weights are missing:
merge(dtreat, unitweight, all=TRUE, by="unit")
## unit treatment Damage Glasshouse Weight
## 1 a A 56.46458 G1 100.48264
## 2 b A 55.75776 G1 99.90312
## 3 c A 69.61792 G1 NA
## 4 d B 67.83915 G2 100.25292
## 5 e B 66.76343 G2 NA
## 6 f B 82.24302 G2 99.90837
## 7 g C 87.68079 G3 NA
## 8 h C 52.74762 G3 100.00242
## 9 i C 76.51946 G3 99.93381
# When all=FALSE, we omit rows where some Weights are missing:
merge(dtreat, unitweight, all=FALSE, by="unit")
## unit treatment Damage Glasshouse Weight
## 1 a A 56.46458 G1 100.48264
## 2 b A 55.75776 G1 99.90312
## 3 d B 67.83915 G2 100.25292
## 4 f B 82.24302 G2 99.90837
## 5 h C 52.74762 G3 100.00242
## 6 i C 76.51946 G3 99.93381
}
cereals data and learn how to make a boxplot, using R base graphics:data(cereals)
# A simple boxplot showing the distribution of sodium for each level of Manufacturer.
boxplot(sodium ~ Manufacturer, data=cereals, ylab="Sodium content", xlab="Manufacturer")
Notice the ordering of the boxes from left to right, and compare it to levels of the factor variable Manufacturer.
reorder, see Section @ref(reorder)).# Using reorder, the levels of Manufacturer are ordered by the mean
# of sodium for each level.
cereals$Manufacturer <- with(cereals, reorder(Manufacturer, sodium, mean))
# Note how the order is now increasing with sodium.
boxplot(sodium ~ Manufacturer, data=cereals, ylab="Sodium content", xlab="Manufacturer")
?boxplot), and change the boxplots so that the width varies with the number of observations per manufacturer (Hint: find the varwidth argument).# With varwidth=TRUE, the width of the boxes is proportional to the sample size.
boxplot(sodium ~ Manufacturer, data=cereals, ylab="Sodium content", xlab="Manufacturer",
varwidth=TRUE)
For this exercise, refer to the tables and examples in Section @ref(distributions).
# pnorm gives the cumulative probability (i.e. prob. that value is less than some value).
pnorm(3.0, 5,2)
## [1] 0.1586553
# Because pnorm gives the probability that X is *less than* some value,
# we need the inverse.
# Solution 1 : use 'lower.tail=FALSE'
pnorm(4.5, 5,2, lower.tail=FALSE)
## [1] 0.5987063
# Solution 2 : calculate the complement (because total probability must be 1!)
1-pnorm(4.5,5,2)
## [1] 0.5987063
# qnorm is the opposite of pnorm. Given a probability, find K so that the probability
# of X is less than K is equal to that probability. Note that the default of qnorm
# is to find the probability *less than* some value. We therefore need the inverse.
# Because total probability is 1,
# P(X > K) = 0.05 is the same as P(X < K) = 1-0.05
qnorm(0.95, 5, 2)
## [1] 8.289707
# Or, use lower.tail=FALSE
qnorm(0.05, 5, 2, lower.tail=FALSE)
## [1] 8.289707
# dbinom finds the probability of 'x' occurrences (0 in this case) when we
# repeat N ('size') events (here, 10), each with probability 'prob' (here, 0.5).
dbinom(x = 0, size = 10, prob = 0.5)
## [1] 0.0009765625
# Same as before, but now K=5.
dbinom(x = 5, size = 10, prob = 0.5)
## [1] 0.2460938
# Here we can use the cumulative probability, but realizing that the default gives us
# the probability of *less than* the given number of events.
1-pbinom(q=7, size=10, prob=0.5)
## [1] 0.0546875
x.# rnorm simulates from a normal distribution.
# Here we store the results in vector x.
x <- rnorm(n=100, mean=100, sd=5)
boxplot function on your vector!# This command puts the next two plots side-by-side
par(mfrow=c(1,2))
hist(x)
boxplot(x)
mean(x)
## [1] 98.86776
sd(x)
## [1] 4.941017
t.test). In science, it is customary (though debatable) to call an effect significant if the p-value is smaller than 0.05. Note that here we test whether the true mean is different from 100 - in this case we know the true mean since the data were sampled from a normal distribution with mean 100. Bonus question: how often do we find a p-value smaller than 0.05 in this example, do you think? (that is, resampling the data from step 1., and then testing).t.test(x, mu=100)
##
## One Sample t-test
##
## data: x
## t = -2.2915, df = 99, p-value = 0.02405
## alternative hypothesis: true mean is not equal to 100
## 95 percent confidence interval:
## 97.88735 99.84816
## sample estimates:
## mean of x
## 98.86776
t.test(x, mu=90)
##
## One Sample t-test
##
## data: x
## t = 17.947, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 90
## 95 percent confidence interval:
## 97.88735 99.84816
## sample estimates:
## mean of x
## 98.86776
wilcox.test(x, mu=100)
##
## Wilcoxon signed rank test with continuity correction
##
## data: x
## V = 1863, p-value = 0.02294
## alternative hypothesis: true location is not equal to 100
wilcox.test(x, mu=90)
##
## Wilcoxon signed rank test with continuity correction
##
## data: x
## V = 5027, p-value < 2.2e-16
## alternative hypothesis: true location is not equal to 90
For this question, use the pupae data.
t.test function to compare PupalWeight by T_treatment.data(pupae)
# When we use a formula in t.test, it will test the means of PupalWeight between the two
# levels of T_treatment. Note that this only works when the factor (on the right-hand side)
# contains exactly two levels.
t.test(PupalWeight ~ T_treatment, data=pupae,
var.equal=TRUE)
##
## Two Sample t-test
##
## data: PupalWeight by T_treatment
## t = 1.4385, df = 82, p-value = 0.1541
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.007715698 0.048012420
## sample estimates:
## mean in group ambient mean in group elevated
## 0.3222973 0.3021489
wilcox.test(PupalWeight ~ T_treatment, data=pupae)
## Warning in wilcox.test.default(x = c(0.244, 0.319, 0.221, 0.28, 0.257,
## 0.333, : cannot compute exact p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: PupalWeight by T_treatment
## W = 1017.5, p-value = 0.1838
## alternative hypothesis: true location shift is not equal to 0
base <- rnorm(20, 20, 5)
x <- base + rnorm(20,0,0.5)
y <- base + rnorm(20,1,0.5)
x and y, assume that the variance is equal for the two samples.t.test(x,y, var.equal=TRUE)
##
## Two Sample t-test
##
## data: x and y
## t = -0.50195, df = 38, p-value = 0.6186
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.259418 2.566839
## sample estimates:
## mean of x mean of y
## 19.11583 19.96212
# The p-value is much smaller.
t.test(x,y, paired=TRUE)
##
## Paired t-test
##
## data: x and y
## t = -7.1328, df = 19, p-value = 8.816e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.0946202 -0.5979584
## sample estimates:
## mean of the differences
## -0.8462893
# The paired t-test is more appropriate because X and Y are
# not independent : they use the same 'base' value.
Continue with the pupae data. Perform a simple linear regression of Frass on PupalWeight. Produce and inspect the following:
model <- lm(Frass ~ PupalWeight, data = pupae)
summary(model)
##
## Call:
## lm(formula = Frass ~ PupalWeight, data = pupae)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77463 -0.21560 -0.01064 0.26259 0.89392
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5046 0.1838 2.745 0.00746 **
## PupalWeight 4.2994 0.5773 7.448 9.1e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3332 on 81 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.4065, Adjusted R-squared: 0.3991
## F-statistic: 55.47 on 1 and 81 DF, p-value: 9.1e-11
# To place side-by-side
par(mfrow=c(1,2))
# QQ plot and residual plot.
library(car)
residualPlot(model)
qqPlot(model)
## [1] 2 25
Gender is 0, and CO2\_treatment is 400.# We can pass a subset directly to lm(). Alternatively, make the subset first with subset().
plot(Frass ~ PupalWeight, data = pupae, subset=Gender==0 & CO2_treatment == 400)
model <- lm(Frass ~ PupalWeight, data = pupae, subset=Gender==0 & CO2_treatment == 400)
summary(model)
##
## Call:
## lm(formula = Frass ~ PupalWeight, data = pupae, subset = Gender ==
## 0 & CO2_treatment == 400)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26720 -0.08526 -0.01585 0.13171 0.28181
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6751 0.1845 3.660 0.00156 **
## PupalWeight 4.1189 0.6430 6.405 3.01e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1657 on 20 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.6723, Adjusted R-squared: 0.6559
## F-statistic: 41.03 on 1 and 20 DF, p-value: 3.006e-06
par(mfrow=c(1,2))
library(car)
residualPlot(model)
qqPlot(model)
## 23 64
## 5 13
You have already used quantile-quantile (QQ) plots many times, but in this exercise you will get to the bottom of the idea of comparing quantiles of distributions.
As in the previous exercises, we will use the pupae data.
PupalWeight and store it as a vector called ‘pupweight’. Make a histogram of this vector (@ref(hist)), noticing that the distribution seems perhaps quite like the normal distribution.pupweight <- pupae$PupalWeight
hist(pupweight)
rnorm with the mean and standard deviation of the pupal weights (i.e. pupweight), and the same sample size as well. Plot it repeatedly to get an idea of whether the simulated histogram looks similar often enough.hist(rnorm(length(pupweight), mean=mean(pupweight), sd=sd(pupweight)))
# Better yet, place them side-by-side:
par(mfrow=c(1,2))
hist(pupweight)
hist(rnorm(length(pupweight), mean=mean(pupweight), sd=sd(pupweight)))
pupweight, we are looking for the values below which 25% or 75% of all observations occur. Clearly if two distributions have the same shape, their quantiles should be roughly similar. Calculate the 25, 50 and 75% quantiles for pupweight, and also calculate them for the normal distribution using qnorm. Are they similar?# Some quantiles of the measured distribution
quantile(pupweight, c(0.25, 0.5, 0.75))
## 25% 50% 75%
## 0.25625 0.29750 0.35600
# Some quantiles of the standard normal distribution with the same mean and sd:
qnorm(c(0.25, 0.5, 0.75), mean=mean(pupweight), sd=sd(pupweight))
## [1] 0.2677620 0.3110238 0.3542856
# The values are very similar! We can conclude that for these quantiles at least, our data
# behaves as if they were drawn from a normal distribution.
seq to make the vector of quantiles, and use it both in quantile and qnorm. Save the results of both those as vectors, and plot. As a comparison, use qqPlot(pupweight, distribution="norm") (car package), make sure to plot the normal quantiles on the X-axis.# Set up a vector of probabilities, used in calculating the quantiles.
qs <- seq(0.025, 0.975, by=0.025)
# Quantiles of the measured data
q_meas <- quantile(pupweight, probs=qs)
# Quantiles of the corresponding normal distribution
q_norm <- qnorm(qs, mean=mean(pupweight), sd=sd(pupweight))
# A simple 1:1 plot
plot(q_norm, q_meas)
abline(0,1)
# A standard QQ plot. Make sure to use the normal distribution because qqPlot uses
# the t-distribution by default.
# The two are not exactly the same because of the choice of the quantiles,
# but they are similar enough.
library(car)
qqPlot(pupweight, distribution="norm")
## [1] 29 37
titanic data, use a one-way ANOVA to compare the average passenger age by passenger class. (Note: by default, lm will delete all observations where Age is missing.)# Read data
data(titanic)
fit <- lm(Age~PClass, data=titanic)
summary(fit)
##
## Call:
## lm(formula = Age ~ PClass, data = titanic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.748 -7.300 -0.668 7.791 42.700
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.6678 0.8556 46.360 <2e-16 ***
## PClass2nd -11.3676 1.2299 -9.243 <2e-16 ***
## PClass3rd -14.4592 1.1191 -12.920 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.86 on 753 degrees of freedom
## (557 observations deleted due to missingness)
## Multiple R-squared: 0.1884, Adjusted R-squared: 0.1862
## F-statistic: 87.38 on 2 and 753 DF, p-value: < 2.2e-16
Older subjects, and conduct a one-way ANOVA to compare words remembered by memory technique.data(memory)
memOlder <- subset(memory, Age=="Older")
fit <- lm(Words~Process, data=memOlder)
summary(fit)
##
## Call:
## lm(formula = Words ~ Process, data = memOlder)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.00 -1.85 -0.45 2.00 9.60
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0000 0.9835 11.184 1.39e-14 ***
## ProcessCounting -4.0000 1.3909 -2.876 0.00614 **
## ProcessImagery 2.4000 1.3909 1.725 0.09130 .
## ProcessIntentional 1.0000 1.3909 0.719 0.47589
## ProcessRhyming -4.1000 1.3909 -2.948 0.00506 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.11 on 45 degrees of freedom
## Multiple R-squared: 0.4468, Adjusted R-squared: 0.3976
## F-statistic: 9.085 on 4 and 45 DF, p-value: 1.815e-05
PupalWeight to Gender and CO2_treatment. Which main effects are significant? After reading in the pupae data, make sure to convert Gender and CO2_treatment to a factor first (see Section@ref(workingfactors)).data(pupae)
pupae$CO2_treatment <- as.factor(pupae$CO2_treatment)
fit <- lm(PupalWeight ~ Gender+CO2_treatment, data=pupae)
summary(fit)
##
## Call:
## lm(formula = PupalWeight ~ Gender + CO2_treatment, data = pupae)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.126338 -0.028338 0.000162 0.022538 0.190663
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.26217 0.00886 29.589 < 2e-16 ***
## Gender 0.07817 0.01053 7.423 1.48e-10 ***
## CO2_treatment400 0.02017 0.01049 1.923 0.0583 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04623 on 75 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.4434, Adjusted R-squared: 0.4285
## F-statistic: 29.87 on 2 and 75 DF, p-value: 2.872e-10
library(car)
Anova(fit)
## Anova Table (Type II tests)
##
## Response: PupalWeight
## Sum Sq Df F value Pr(>F)
## Gender 0.117781 1 55.1016 1.484e-10 ***
## CO2_treatment 0.007902 1 3.6966 0.05832 .
## Residuals 0.160314 75
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Gender and CO2_treatment?# To test the interaction, add all terms (using *), and inspect the p-value for the interaction
# term in the ANOVA table.
fit <- lm(PupalWeight ~ Gender*CO2_treatment, data=pupae)
Anova(fit)
## Anova Table (Type II tests)
##
## Response: PupalWeight
## Sum Sq Df F value Pr(>F)
## Gender 0.117781 1 54.3883 1.955e-10 ***
## CO2_treatment 0.007902 1 3.6488 0.05998 .
## Gender:CO2_treatment 0.000063 1 0.0292 0.86489
## Residuals 0.160251 74
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
T_treatment instead of CO2_treatment.fit <- lm(PupalWeight ~ Gender+T_treatment, data=pupae)
summary(fit)
##
## Call:
## lm(formula = PupalWeight ~ Gender + T_treatment, data = pupae)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.134744 -0.026092 -0.000259 0.023459 0.197226
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.275774 0.009946 27.728 < 2e-16 ***
## Gender 0.078203 0.010836 7.217 3.63e-10 ***
## T_treatmentelevated -0.005233 0.010909 -0.480 0.633
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04729 on 75 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.4177, Adjusted R-squared: 0.4022
## F-statistic: 26.9 on 2 and 75 DF, p-value: 1.556e-09
Anova(fit)
## Anova Table (Type II tests)
##
## Response: PupalWeight
## Sum Sq Df F value Pr(>F)
## Gender 0.116457 1 52.0824 3.635e-10 ***
## T_treatment 0.000515 1 0.2301 0.6328
## Residuals 0.167701 75
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit2 <- lm(PupalWeight ~ Gender*T_treatment, data=pupae)
Anova(fit2)
## Anova Table (Type II tests)
##
## Response: PupalWeight
## Sum Sq Df F value Pr(>F)
## Gender 0.116457 1 51.4578 4.663e-10 ***
## T_treatment 0.000515 1 0.2274 0.6349
## Gender:T_treatment 0.000227 1 0.1005 0.7521
## Residuals 0.167474 74
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pulse datasetThe pulse data contains measurements of heart rate (“pulse”) for individuals before and after running (and some control individuals) - and the individuals’ height, weight and age (and other interesting variables). The Pulse1 variable is the resting heart rate before any treatment, and Pulse2 the heart rate after the treatment (half the subjects engaged in running).
Pulse1 against Age, Weight and Height (add the variables to the model in that order). Are any terms significant at the 5% level? What is the R2?data(pulse)
# Fit model (without interactions)
pulse_fit <- lm(Pulse1 ~ Age+Weight+Height, data=pulse)
summary(pulse_fit)
##
## Call:
## lm(formula = Pulse1 ~ Age + Weight + Height, data = pulse)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.838 -9.016 -0.115 6.497 70.923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 123.39317 14.98976 8.232 5.42e-13 ***
## Age -0.45843 0.31739 -1.444 0.1516
## Weight -0.01985 0.10079 -0.197 0.8443
## Height -0.21542 0.09399 -2.292 0.0239 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.82 on 105 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.09705, Adjusted R-squared: 0.07125
## F-statistic: 3.762 on 3 and 105 DF, p-value: 0.01304
library(car)
Anova(pulse_fit)
## Anova Table (Type II tests)
##
## Response: Pulse1
## Sum Sq Df F value Pr(>F)
## Age 342.6 1 2.0861 0.1516
## Weight 6.4 1 0.0388 0.8443
## Height 862.7 1 5.2532 0.0239 *
## Residuals 17244.0 105
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Plot model diagnostics
par(mfrow=c(1,2))
residualPlot(pulse_fit)
qqPlot(pulse_fit)
## [1] 73 106
Exercise in the regression model (an indicator of whether the subject exercises frequently or not). You will need to first convert Exercise to a factor as it is stored numerically in the data. Does adding Exercise improve the model?pulse$Exercise <- as.factor(pulse$Exercise)
pulse_fit2 <- lm(Pulse1 ~ Age+Weight+Height+Exercise, data=pulse)
summary(pulse_fit2)
##
## Call:
## lm(formula = Pulse1 ~ Age + Weight + Height + Exercise, data = pulse)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.564 -7.706 -0.548 6.543 70.318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 114.500042 15.440027 7.416 3.53e-11 ***
## Age -0.551732 0.317033 -1.740 0.0848 .
## Weight 0.006992 0.101024 0.069 0.9450
## Height -0.199932 0.093220 -2.145 0.0343 *
## Exercise2 6.444701 3.812995 1.690 0.0940 .
## Exercise3 8.677150 4.105668 2.113 0.0370 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.67 on 103 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.1347, Adjusted R-squared: 0.09267
## F-statistic: 3.206 on 5 and 103 DF, p-value: 0.009894
Anova(pulse_fit2)
## Anova Table (Type II tests)
##
## Response: Pulse1
## Sum Sq Df F value Pr(>F)
## Age 485.9 1 3.0286 0.08479 .
## Weight 0.8 1 0.0048 0.94496
## Height 738.0 1 4.5999 0.03432 *
## Exercise 718.5 2 2.2391 0.11171
## Residuals 16525.5 103
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# AIC shows only a very marginal improvement.
AIC(pulse_fit, pulse_fit2)
## df AIC
## pulse_fit 5 871.2904
## pulse_fit2 7 870.6514
Pulse2 as a function of Pulse1 and Ran as main effects only (Note: convert Ran to a factor first). Use the visreg package to understand the fit.pulse$Ran <- as.factor(pulse$Ran)
pulse_fit3 <- lm(Pulse2~Pulse1+Ran, data=pulse)
Anova(pulse_fit3)
## Anova Table (Type II tests)
##
## Response: Pulse2
## Sum Sq Df F value Pr(>F)
## Pulse1 15033 1 76.882 3.29e-14 ***
## Ran 72836 1 372.497 < 2.2e-16 ***
## Residuals 20727 106
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# It is important that you visualize this model,
# since it has a numeric and a factor variable.
library(visreg)
visreg(pulse_fit3, "Pulse1", by="Ran", overlay = TRUE)
# It appears that the effect of running is a constant increase in heart rate -
# independent of the resting heart rate (since the lines are practically parallel).
Pulse1 and Ran. Is it significant? Also look at the effects plot, how is it different from the model without an interaction?pulse_fit4 <- lm(Pulse2~Pulse1*Ran, data=pulse)
Anova(pulse_fit4)
## Anova Table (Type II tests)
##
## Response: Pulse2
## Sum Sq Df F value Pr(>F)
## Pulse1 15033 1 76.1605 4.337e-14 ***
## Ran 72836 1 369.0002 < 2.2e-16 ***
## Pulse1:Ran 1 1 0.0049 0.9442
## Residuals 20726 105
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
visreg(pulse_fit4, "Pulse1", by="Ran", overlay = TRUE)
pulse dataset, as the difference between Pulse2 and Pulse1 (the change in heartrate after exercise). Now fit a linear regression model with all of these terms: “Height”, “Weight”, “Age”, “Gender”, “Smokes”, “Alcohol”,“Exercise”,“Ran”. As expected Ran wil be most important, but how do all the other variables rank in importance?pulse$change_pulse <- pulse$Pulse2 - pulse$Pulse1
fit_pulse <- lm(change_pulse ~ Height + Weight + Age + Gender + Smokes + Alcohol + Exercise + Ran,
data=pulse)
library(relaimpo)
fit_pulse_importance <- calc.relimp(fit_pulse, type="lmg")
sort(fit_pulse_importance$lmg, TRUE)
## Ran Smokes Age Alcohol Height
## 0.7637838281 0.0061727722 0.0035606737 0.0025262570 0.0020345742
## Weight Exercise Gender
## 0.0019353579 0.0016129112 0.0005446059
Pulse2 can predict whether people were in the Ran group. Make sure that Ran is coded as a factor.# This is a logistic regression, since Ran takes only two possible values.
fit6 <- glm(Ran ~ Pulse2, data=pulse, family=binomial)
summary(fit6)
##
## Call:
## glm(formula = Ran ~ Pulse2, family = binomial, data = pulse)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.98452 -0.05046 0.13816 0.32434 2.88698
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 14.70267 3.34818 4.391 1.13e-05 ***
## Pulse2 -0.15712 0.03828 -4.104 4.06e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 148.444 on 108 degrees of freedom
## Residual deviance: 44.847 on 107 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 48.847
##
## Number of Fisher Scoring iterations: 7
visreg package is very helpful in visualizing the fitted model. For the logistic regression you just fit , run the following code and make sure you understand the output. (This code assumes you called the object fit6, if not change fit6 to the name you used.)# This plot shows the fitted logistic regression (with an approximate
# confidence interval).
# When pulse was low, the probability that the subject was in the 'Ran' group
# is very close to 1.
library(visreg)
visreg(fit6, scale="response")
addtwo <- function(num1, num2){
(num1 + num2)/2
}
substr function. First, using that function to extract the first 2 characters of a bit of text. Then, write a function called firstTwoChars that extracts the first two characters of any bit of text.firstTwoChars <- function(txt){
twoch <- substr(txt, 1, 2)
return(twoch)
}
is.na and any). The function should return TRUE if there are missing values, and FALSE if not.anymiss <- function(x)any(is.na(x))
which function). You can use message to write messages to the console.anymiss <- function(x){
miss <- any(is.na(x))
if(miss){
message("The following observations are missing:")
print(which(is.na(x)))
}
return(miss)
}
readline can be used to ask for data to be typed in. First, figure out how to use readline by reading the corresponding help file. Then, construct a function called getAge that asks the user to type his/her age. (Hint: check the examples in the readline help page).getAge <- function(){
age <- readline("How old are you: ")
return(as.numeric(age))
}
calculateCI <- function(x, alpha=0.05){
xbar <- mean(x)
s <- sd(x)
n <- length(x)
half.width <- qt(1-alpha/2, n-1)*s/sqrt(n)
# Confidence Interval
CI <- c(xbar - half.width, xbar + half.width)
return(CI)
}
calculateCI(rnorm(100))
## [1] -0.2403044 0.1411257
head and tail. Write a function called middle that shows a few rows around (approx.) the ‘middle’ of the dataset. Hint: use nrow, print, and possibly floor.# Solution 1 : Shows ten rows starting at middle
middle <- function(x, n=10){
m <- floor(nrow(x)/2)
sub <- x[m:(m+n),]
print(sub)
}
# Solution 2 : Shows ten rows AROUND middle
middle <- function(x, n=10){
m <- floor(nrow(x)/2)
start <- floor(m - n/2)
end <- ceiling(m + n/2)-1
sub <- x[start:end,]
print(sub)
}
First read the following list:
veclist <- list(x=1:5, y=2:6, z=3:7)
sapply, check that all elements of the list are vectors of the same length. Also calculate the sum of each element.sapply(veclist, length)
## x y z
## 5 5 5
all(sapply(veclist,length) == 5)
## [1] TRUE
sapply(veclist, sum)
## x y z
## 15 20 25
veclist$norms <- rnorm(10)
pupae data, use a \(t\)-test to find if PupalWeight varies with temperature treatment, separate for the two CO\(_2\) treatments (so, do two \(t\)-tests). You must use split and lapply.data(pupae)
pupae_sp <- split(pupae, pupae$CO2_treatment)
lapply(pupae_sp, function(x)t.test(PupalWeight ~ T_treatment, data=x))
## $`280`
##
## Welch Two Sample t-test
##
## data: PupalWeight by T_treatment
## t = -0.81696, df = 31.012, p-value = 0.4202
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.0521665 0.0223265
## sample estimates:
## mean in group ambient mean in group elevated
## 0.29000 0.30492
##
##
## $`400`
##
## Welch Two Sample t-test
##
## data: PupalWeight by T_treatment
## t = 2.1824, df = 42.886, p-value = 0.0346
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.003259507 0.082653536
## sample estimates:
## mean in group ambient mean in group elevated
## 0.3419565 0.2990000
coweeta data - a dataset with measurements of tree size. Split the data by species, to produce a list called coweeta_sp. Keep only those species that have at least 10 observations. (Hint: first count the number of observations per species, save that as a vector, find which are at least 10, and use that to subscript the list.) If you don’t know how to do this last step, skip it and continue to the next item.data(coweeta)
# Make list of dataframes
coweeta_sp <- split(coweeta, coweeta$species)
# Keep only species with at least 10 observations
# SOLUTION 1.
nspec <- sapply(coweeta_sp,nrow) # count number of species
morethan10 <- nspec > 9 # logical vector, TRUE when nspec > 9
# Use that to index the list (with single square bracket!!!)
coweeta_sp <- coweeta_sp[morethan10]
# SOLUTION 2.
nspec <- sapply(coweeta_sp,nrow)
# find names of species that have at least 10 observations
morethan10 <- names(nspec)[nspec > 9]
# Use that to subset the original data, and resplit
coweeta_subs <- droplevels(subset(coweeta, species %in% morethan10))
coweeta_sp <- split(coweeta_subs, coweeta_subs$species)
# SOLUTION 3 (dplyr)
library(dplyr)
coweeta_subs2 <- group_by(coweeta, species) %>%
filter(n() >= 10)
log10(biomass) on log10(height), separately by species. Use lapply.lms <- lapply(coweeta_sp, function(x)lm(log10(biomass) ~ log10(height),
data=x))
First run this code to produce two vectors.
x <- rnorm(100)
y <- x + rnorm(100)
x <- rnorm(100)
y <- x + rnorm(100)
lmfit <- lm(y ~ x)
str(lmfit)
## List of 12
## $ coefficients : Named num [1:2] -0.16 0.909
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "x"
## $ residuals : Named num [1:100] -0.785 -1.065 0.358 0.739 -0.105 ...
## ..- attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...
## $ effects : Named num [1:100] 0.1976 9.1328 0.552 0.7473 -0.0943 ...
## ..- attr(*, "names")= chr [1:100] "(Intercept)" "x" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:100] -1.042 -0.236 -1.258 0.476 0.453 ...
## ..- attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## ..$ qr : num [1:100, 1:2] -10 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:100] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:2] "(Intercept)" "x"
## .. ..- attr(*, "assign")= int [1:2] 0 1
## ..$ qraux: num [1:2] 1.1 1.01
## ..$ pivot: int [1:2] 1 2
## ..$ tol : num 1e-07
## ..$ rank : int 2
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 98
## $ xlevels : Named list()
## $ call : language lm(formula = y ~ x)
## $ terms :Classes 'terms', 'formula' language y ~ x
## .. ..- attr(*, "variables")= language list(y, x)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "y" "x"
## .. .. .. ..$ : chr "x"
## .. ..- attr(*, "term.labels")= chr "x"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(y, x)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
## $ model :'data.frame': 100 obs. of 2 variables:
## ..$ y: num [1:100] -1.828 -1.301 -0.9 1.215 0.348 ...
## ..$ x: num [1:100] -0.971 -0.084 -1.207 0.699 0.674 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language y ~ x
## .. .. ..- attr(*, "variables")= language list(y, x)
## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "y" "x"
## .. .. .. .. ..$ : chr "x"
## .. .. ..- attr(*, "term.labels")= chr "x"
## .. .. ..- attr(*, "order")= int 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(y, x)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
## - attr(*, "class")= chr "lm"
hist(lmfit$residuals)
lm object as an argument, and plots a histogram of the residuals.# good:
lmResidHist <- function(lmobj){
hist(lmobj$residuals)
}
# better:
# You can now use any argument that hist() recognizes,
# they are passed through the use of '...'
lmResidHist <- function(lmobj,...){
hist(lmobj$residuals,...)
}
lmResidHist(lmfit, main="My Residuals")
Manufacturer has at least two observations (use table to find out which you want to keep first). Don’t forget to drop the empty factor level you may have created!data(cereals)
tab <- table(cereals$Manufacturer)
morethan1 <- names(tab)[tab > 1]
cereals <- droplevels(subset(cereals, Manufacturer %in% morethan1))
cerealsp <- split(cereal, cereal$Manufacturer)
pdf("cereal plots.pdf", onefile=TRUE)
for(i in 1:length(cerealsp)){
with(cerealsp[[i]],
plot(potass, fiber, main=names(cerealsp)[i])
)
}
dev.off()
hfemet2008) that makes a scatter plot between PAR (a measure of light intensity) and VPD (a measure of the dryness of air).# This function works for any dataframe, as long as it has
# columns names 'PAR' or 'VPD' in it.
PARVPD <- function(dat){
with(dat, plot(PAR, VPD))
}
data(hfemet2008)
library(lubridate)
hfemet2008$DateTime <- mdy_hm(as.character(hfemet2008$DateTime))
# extract month,
hfemet2008$month <- month(hfemet2008$DateTime)
hfesp <- split(hfemet2008, hfemet2008$month)
# windows(10,10) # optional - add this command if the window is too small
par(mfrow=c(3,4))
for(i in 1:12)PARVPD(hfesp[[i]])